Context-Aware Wrapping: Synchronized Data Extraction

نویسندگان

  • Shui-Lung Chuang
  • Kevin Chen-Chuan Chang
  • ChengXiang Zhai
چکیده

The deep Web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic yet accurate in building an integration system, we observe two problems: First, across sequential tasks in integration, how can a wrapper (as an extraction task) consider the peer sources to facilitate the subsequent matching task? Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy? These issues, while seemingly unrelated, both boil down to the lack of “context awareness”: Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration. We propose the concept of context-aware wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization framework to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and develop the turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping. Our experiments show that the turbo syncer can, on the one hand, enhance extraction consistency and thus increase matching accuracy (from 17-83% to 78-94% in F-measure) and, on the other hand, incorporate peer wrappers and domain knowledge seamlessly to reduce extraction errors (from 09-60% to 01-11%).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context-aware Modeling for Spatio-temporal Data Transmitted from a Wireless Body Sensor Network

Context-aware systems must be interoperable and work across different platforms at any time and in any place. Context data collected from wireless body area networks (WBAN) may be heterogeneous and imperfect, which makes their design and implementation difficult. In this research, we introduce a model which takes the dynamic nature of a context-aware system into consideration. This model is con...

متن کامل

Integration Issues of an Ontology Based Context Modelling Approach

In this paper we analyse the applicability of our ontology based context modelling approach, considering a range of use cases. After wrapping up the model and the Context Ontology Language (CoOL) derived from it, we introduce some interesting applications of the language, based on a scenario showing the challenges in context aware service interactions. We focus on two submodels of our model for...

متن کامل

Applications of a Context Ontology Language

In this paper we analyse the applicability of our Context Ontology Language (CoOL), considering a range of use cases. After wrapping up the model in use within this language, we introduce some interesting applications of the language, based on a scenario showing the challenges in context aware service interactions. We focus on two submodels of our model for context aware service interactions, n...

متن کامل

Context-aware systems: concept, functions and applications in digital libraries

Background and Aim Among the places that context-aware systems and services would be very useful, are libraries. The purpose of this study is to achieve a coherent definition of context aware systems and applications, especially in digital libraries. Method: This was a review article that was conducted by using Library method by searching articles and e-books on websites and databases. Results:...

متن کامل

Unstructured Data Integration through Automata-Driven Information Extraction

Extracting information from plain text and restructuring them into relational databases raise a challenge as how to locate relevant information and update database records accordingly. In this paper, we propose a wrapper to efficiently extract information from unstructured documents, containing plain text expressed with natural-like language. Our extraction approach is based on the automata for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007